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Abstract 

Multi-task sparse feature learning aims to improve the generalization performance by ex- 
ploiting the shared features among tasks. It has been successfully applied to many applica- 
tions including computer vision and biomedical informatics. Most of the existing multi-task 
sparse feature learning algorithms are formulated as a convex sparse regularization prob- 
lem, which is usually suboptimal, due to its looseness for approximating an ^o-type regu- 
larizes In this paper, we propose a non-convex formulation for multi-task sparse feature 
learning based on a novel non-convex regularizer. To solve the non-convex optimization 
problem, we propose a Multi-Stage Multi-Task Feature Learning (MSMTFL) algorithm; 
we also provide intuitive interpretations, detailed convergence and reproducibility analysis 
for the proposed algorithm. Moreover, we present a detailed theoretical analysis showing 
that MSMTFL achieves a better parameter estimation error bound than the convex for- 
mulation. Empirical studies on both synthetic and real-world data sets demonstrate the 
effectiveness of MSMTFL in comparison with the state of the art multi-task sparse feature 
learning algorithms. 

Keywords: Multi-Task Learning, Multi-Stage, Non-convex, Sparse Learning 



1. Introduction 



Multi-task learning (MTL) (jCaruanal . Il997l ) exploits the relationships among multiple re- 
lated tasks to improve the generalization perform ance. It has been successfully applie d 
to many applications such as speech classification ( Parameswara n and Weinberger . 201ol ) . 



handwritten chara cter recognitioii (jObozinski et al.l . ,2006 : Quadrianto et al.l . l20ld ) and 



medical diagnosis ( Bi et al. . 20081 ). One common assumption in multi-task learning is 



that all tasks shoul d share some comr u on structures includi r ig the prior or pa rameters 



of Bayesian models dSchwaiehofer et all, boosi: IYu et al.l. l2005l : Izhang et al.] . hood ) 



2011 



a sim- 



i larity metric matrix ( Parameswaran and Weinberger j_ 20ld). a classification weight vector 



(lEvgeniou and Ponti a low rank su bspace dChen et al.l.l2O10l:lNegahban and Wainwrightl. 



2009; Ko lar et al.l. l201ll: iLounici et al 



2008. : Obozinski et al.l . l2006l : 



Yang et al. 



and a comru o n set of shared features ( Argyriou et al. ^ 20081^ Gong et al. ^ 2012 :lKim and Xing , 
fTZmrm lonnl. ITTZZTTTTI I2009': 'lI u et al.l. 12 009": 'N egahban and Wainwrightl 



, 12009. : ,Zhang et al- .201Q ). 
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Multi-task feature learning, which aims to learn a common set of shared features, has 
received a lot of interests in machine learning recently, due to the popularity of various 
sparse learning formulations and their successful applications in many problems. In this 
paper, we focus on a specific multi-task feature learning setting, in which we learn the fea- 
tures specific to each task as well as the common features shared among tasks. Although 
many multi-task feature learning algorithms have been proposed in the past, many of them 
require the r elevant features to be shared by all tasks. This i s too restrict i ve in real- world 
applications ( Jalali et al. . 20ld ). To overcome this limitation, Jalali et al. ( 20101 ) proposed 
an ii + d-i.oo regularized formulation, called dirty model, to leverage the common features 
shared amon g tasks. The dirty m odel allows a certain feature to be shared by some tasks but 
not all tasks. 'Ja lah et alJ (I2OIO') also presented a theor etical analysis under the incoherence 
condition (.Donoho et al.l. 12006: Obozinski et al. . 201 ll ) which is more restrictive than RIP 
( Candes and Tao . 2005 : Zhan j . 2012). The £1 +^1,00 regularizer is a convex relaxation for 
the £o-typ6 one, in which a globally optimal solution can be obtained. However, a convex 
regularizer is known to too loose to approximate the £o-type one and often achieves subopti- 
mal performance (either require restrictive c onditions or obtain a suboptimal error bound) 
(jZhang and Zhand . l2012l : Izhand . l2O10l . l2012l ^ . To remedy the limitation, a non-convex regu- 
larizer can be used instead. However, the non-convex formulation is usually difficult to solve 
and a globally optimal solution can not be obtained in most practical problems. Moreover, 
the solution of the non-convex formulation heavily depends on the specific optimization 
algorithms employed. Even with the same optimization algorithm adopted, different ini- 
tializations usually lead to different solutions. Thus, it is often challenging to analyze the 
theoretical behavior of a non-convex formulation. 

Contributions: We propose a non-convex formulation, called capped-£i,£i regularized 
model for multi-task feature learning. The proposed model aims to simultaneously learn 
the features specific to each task as well as the common features shared among tasks. 
We propose a Multi-Stage Multi-Task Feature Learning (MSMTFL) algorithm to solve 
the non-convex optimization problem. We also provide intuitive interpretations of the 
proposed algorithm from several aspects. In addition, we present a detailed convergence 
analysis for the proposed algorithm. To address the reproducibility issue of the non-convex 
formulation, we show that the solution generated by the MSMTFL algorithm is unique (i.e., 
the solution is reproducible) under a mild condition, which facilitates the theoretical analysis 
of the MSMTFL algorithm. Although the MSMTFL algorithm may not obtain a globally 
optimal solution, we show that this solution achieves good performance. Specifically, we 
present a detailed theoretical analysis on the parameter estimation error bound for the 
MSMTFL algorithm. Our analysis shows that, under the spa.rse ei genvalue condition which 
is weaker than the incoherence condition used in I.Talah et al.l (|2O10l ). MSMTFL improves the 
error bound during the multi-stage iteration, i.e., the error bound at the current iteration 
improves the one at the last iteration. Empirical studies on both synthetic and real-world 
data sets demonstrate the effectiveness of the MSMTFL algorithm in comparison with the 
state of the art algorithms. 

Notations: Scalars and vectors are denoted by lower case letters and bold face lower 
case letters, respectively. Matrices and sets are denoted by capital letters and calligraphic 
capital letters, respectively. The £1 norm, Euclidean norm, -^^o norm and Frobenius norm 



are denoted by 



and II • \\fi respectively. | • | denotes the absolute value of 
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a scalar or the number of elements in a set, depending on the context. We define the ^p^q 

norm of a matrix X as H-'^Hp.g = {YIi ((X^j l^ul'^)^^'') ) • We define N„ as {!,••• ,n} 
and N{iJ,,a'^) as the normal distribution with mean /i and variance cr^. For a dx m matrix 
W and sets C N(i x {i},X C x N^, we let wx^ be the d x f vector with the j-th entry 
being wji, if (j, i) G Xj, and 0, otherwise. We also let Wx he a dxm matrix with the (j, i)-th. 
entry being Wji, if (j, i) € I, and 0, otherwise. 

Organization: In Section [2l we introduce a non-convex formulation and present the 
corresponding optimization algorithm. In Section [3l we discuss the convergence and repro- 
ducibility issues of the MSMTFL algorithm. In Section HI we present a detailed theoretical 
analysis on the MSMTFL algorithm, in terms of the parameter estimation error bound. 
In Section \5\ we provide a sketch of the proof of the presented theoretical results and the 
detailed proof is provided in the Appendix. In Section [H we report the experimental results 
and we conclude the paper in Section [71 



2. The Proposed Formulation and the Optimization Algorithm 

In this section, we first present a non-convex formulation for multi-task feature learning. 
Then, we show how to solve the corresponding optimization problem. Finally, we provide 
intuitive interpretations and discussions for the proposed algorithm. 



2.1 A Non-convex Formulation 



Assume we are given m learning tasks associated with training data {(Xi, yi) 



where Xj G 



priiXd : 



{Xm, ym)}i 



is the data matrix of the i-th task with each row as a sample; G M"' is 



^"^ consisting 
i G Nm- In 



the response of the i-th task; d is the data dimensionality; rii is the number of samples for 
the i-th task. We consider learning a weight matrix W = [wi, • • • , Wm] G 
of the weight vectors for m linear predictive models: ~ fi{Xi) = XiWi, 
this paper, we propose a non-convex multi-task feature learning formulation to learn these 
m models simultaneously, based on the capped-£i,^i regularization. Specifically, we first 
impose the ii pen alty on each row of W, obtaining a column vector. Then, we impose the 
capped-£i penalty ([Zhang, [2010l . [2012 ) on that vector. Formally, we formulate our proposed 
model as follows: 



mm 



1{W) + A^min i 



W 1 



(1) 



where 1{W) is an empirical loss function of W; A (> 0) is a parameter balancing the empirical 
loss and the regularization; (> 0) is a thresholding parameter; w-^ is the j-th row of the 
matrix W. In this paper, we focus on the following quadratic loss function: 



1{W) 



m ^ 

E- 



\XiWi 



i=l 



mrii 



(2) 



Intuitively, due to the capped-^1,^1 penalty, the optimal solution of Eq. ([T]) denoted as 
W* has many zero rows. For a nonzero row (w*)'^, some entries may be zero, due to the 
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£i-norm imposed on each row of W. Thus, under the formulation in Eq. ([T]), some features 
can be shared by some tasks but not all the tasks. Therefore, the proposed formulation can 
leverage the common features shared among tasks. 

2.2 Optimization Algorithm 

The formulation in Eq. ([T]) is non-convex and is difficult to solve. In this paper, we pro- 
pose an algorithm called Multi-Stage Multi-Task Feature Learning (MSMTFL) to solve the 
optimization problem (see details in Algorithm [T]) . In this algorithm, a key step is how to 
efficiently solve Eq. ([3]). Observing that the objective function in Eq. ([3|) can be decomposed 
into the sum of a differential loss functio n and a non-differential regularization term, we 
employ FISTA (jSeck and Teboullel . I2OO9I I to solve the sub-problem. In the following, we 



present some intuitive interpretations of the proposed algorithm from several aspects. 

Algorithm 1: MSMTFL: Multi-Stage Multi-Task Feature Learning 

1 Initialize Xj^^ = A; 

2 for £ = 1,2, ••• do 
3 



Let W^^^ be a solution of the following problem: 

-i-Jw + EAr^llw^lli^. (3) 



i=i 

Let Xf = AI(||(wW)J||i < 9) {j = !,■■■ ,d), where (w^)^' is the j-th row of W^^^ 
and /(•) denotes the {0, l}-valued indicator function. 



5 end 



2.2.1 Locally Linear Approximation 
First, we define two auxiliary functions: 

h : M-^x"^ ^ h(W) = [llw^lli, • • • , llw'^lli ^ , 

d 

g : ^M+, ^(u) = ^min(n„e). 

i=i 

We note that g{-) is a concave function and we say that a vector s G M*^ is a sub- gradient 
of 5' at V E M^j., if for all vector u E W!^, the following inequality holds: 

g{u) < g{v) + (s,u- v), 

where (•) denotes the inner product. Using the functions defined above, Eq. ([1]) can be 
equivalently rewritten as follows: 

min {1{W) + Xg{h{W))} . (4) 
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Based on the definition of tlie sub-gradient for a concave function given above, we can 
obtain an upper bound of g{h{W)) using a locally linear approximation at h(M^(^)): 



where s^^-* is a sub-gradient of g{u) at u = h(M^*^^^). Furthermore, we can obtain an upper 
bound of the objective function in Eq. (jl]), if the solution W^^^ at the £-th iteration is 
available: 



yW G M^x" : 1{W) + Xg{h{W)) < 1{W) + Xg{h{W^^'^)) + A (^s^^\h{W) - h(#W)^ . (5) 
It can be shown that a sub-gradient of (?(u) at u = h(H^(^^) is 

I(||(wW)i||i<0),---,/(||(wW)'^||i<^)l^, (6) 



which is used in Step 4 of Algorithm [TJ Since both A and h(M^(^)) are constant with respect 
to VF, we have 



1^(^+1) = argmin {/(VF) + Xg{h{W^^y)) + A {^s^, h(W^) - h(H^W)|)} 
= argmin \l{W) + X{s''^^fh{W)} , 



W 

which, as shown in Step 3 of Algorithm [H obtains the next iterative solution by minimizing 
the upper bound of the objective function in Eq. ([3]). Thus, in the viewpoint of the locally 
linear approximation, we can understand Algorithm [T] as follows: The original formulation 
in Eq. @ is non-convex and is difficult to solve; the proposed algorithm minimizes an upper 
bound in each step, which is convex an d can be solved efficient l y. It is closely related to the 
Concave Convex Procedure (CCCP) (lYuille and Rangarajanl . Sol). In addition, we can 



easily verify that the objective function value decreases monotonically as follows: 

+ A5(h(iy(^+i))) < ;(iy(^+i)) + A5(h(iy W)) + A (sW, h(Ty(^+^)) - h(#W) 
< /(lyW) + A5(h(iyW)) + A <|sW, h(H^W) - h(iyW) 
= /(iy(^)) + A5(h(iyW)), 

where the first inequality is due to Eq. ^ and the second inequality follows from the fact 
that is a minimizer of the right hand side of Eq. ([5]). 

An important issue we should mention is that a monotonic decrease of the objective 
function value does not guarantee the convergence of the algorithm, even if the objective 
function is strictly convex and continuously differentiable (see an example in the book 



( Bertsekasl . Il999l . Fig 1.2.6)). In Section [3. 11 we will formally discuss the convergence issue. 



2.2.2 Block Coordinate Descent 



Reca ll that g{u) is a concave function. We can define its conjugate function as (jRockafellarl . 



g*{v) = inf{v'^u-5(u)}. 
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Since ^(u) is also a closed function (i.e., the epigraph of q(u ) is convex), the conjugate 
function of g*(y) is the original function g{u) ( Bertsekas . 19991 . Chap. 5.4), that is: 

5(u)=inf{u^v-<7^(v)}. (7) 

V 

Substituting Eq. ([7|) with u = h(Ty) into Eq. we can reformulate Eq. dH) as: 

min{/(VF,v) = 1{W) + Xv^h{W) - \g*{^)] (8) 

W.v 



A straightforward algorithm for optimizing Eq. ([8]) is the block coordinate descent (jGrippo and Sciandrone 



2000 : iTsena . |200l|) summarized below: 
• Fix = "W^W: 



gmin \^1{W^^^) + Av^h(#W) - A5*(v)} 
gmin|v^h(W^(^)) -/(v)| . 



(9) 



Based on Eq. ([7]) and the Danskin's Theorem ( Bertsekasl . 19991 . Proposition B.25), 
one solution of Eq. Q is given by a sub-gradient of g{u) at u = h.{W^^^). That is, we 
can choose v^^^ = s^^^ given in Eq. Apparently, Eq. ([9]) is equivalent to Step 4 in 
Algorithm [TJ 

Fixv = vW = [/(||(wW)i||i <e),--- ,/(||(wW)'^||i <e)f: 

= argmin \l{W) + A(vW)^h(iy) - A(7*(vW)| 
rgmin |/(1^) + A(v(^))^h(VF)| , 



are 



(10) 



which corresponds to Step 3 of Algorithm [TJ 



The block coordinate descent procedure is intuitive, however, it is non-trivial to analyze its 
convergence behavior. We will present the convergence analysis in Section [3.11 



2.2.3 Discussions 

If we terminate the algorithm with i = 1, the MSMTFL algorithm is equivalent to the ii 
regularized multi-task feature learning algorithm (Lasso). Thus, the solution obtained by 
MSMTFL can be considered as a multi-stage refinement of that of Lasso. Basically, the 
MSMTFL algorithm solves a sequence of weighted Lasso problems, where the weights Aj's 
are set as the product of the parameter A in Eq. ([1]) and a {0, l}-valued indicator function. 
Specifically, a penalty is imposed in the current stage if the ^i-norm of some row of W in 
the last stage is smaller than the threshold 9; otherwise, no penalty is imposed. In other 
words, MSMTFL in the current stage tends to shrink the small rows of W and keep the 
large rows of W in the last stage. However, Lasso (corresponds to ^ = 1) penalizes all 
rows of W in the same way. It may incorrectly keep the irrelevant rows (which should 
have been zero rows) or shrink the relevant rows (which should have been large rows) to be 
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zero vectors. MSMTFL overcomes this limitation by adaptively penalizing the rows of W 
according to the solution generated in the last stage. One important question is whether 
the MSMTFL algorithm can improve the performance during the multi-stage iteration. 
In Section SI we will theoretically show that the MSMTFL algorithm indeed achieves the 
stagewise improvement in terms of the parameter estimation error bound. That is, the 
error bound in the current stage improves the one in the last stage. Empirical studies in 
Section [6] also validate the presented theoretical analysis. 



3. Convergence and Reproducibility Analysis 

In this section, we first present the convergence analysis. Then, we discuss the reproducibil- 
ity issue for the MSMTFL algorithm. 

3.1 Convergence Analysis 

The main convergence result is summarized in the following theorem, which is based on the 
block coordinate descent interpretation. 

Theorem 1 Let {W*,v*) be a limit point of the sequence {W^^\v^^^} generated by the 
block coordinate descent algorithm. Then W* is a critical point of Eq. 



Proof Based on Eq. ([9]) and Eq. (jlOp . we have 

/(I^W, vW) < /(VFW, v), Vv G 

vW) < /(VF, vW), Viy G M'^^™. (11) 

It follows that 

/(H/(^+i),v(^+i)) < /(H^(^+i),vW) < /(t^W,vW), 

which indicates that the sequence {f{W^^\v^^'^)} is monotonically decreasing. Since {W*, v*) 
is a limit point of {W^^^ir^^^}, there exists a subsequence /C such that 



lim (M^W,vW) = (l^*,v*). 

^G/C— >-oo 



We observe that 



f{W, v) = 1{W) + Av^h(l^) - A/(v) 
> 1{W) + Xg{h{W)) > 0, 

where the first inequality above is due to Eq. ([7j). Thus, {f{W^^\v^^'')}£^jc is bounded below. 
Together with the fact that {fiW^^Kv^^^)} is decreasing, lim^^^oo f{W^^\v^^^) > — oo exists. 
Since f(W,v) is continuous, we have 

hm /(M^W, vW) = hm /(W^W, v^) = v*). 

Taking limits on both sides of Eq. (jlip with £ G /C — )• c«, we have 

f{W*,v*) < /(W, V*), G M^^"", 
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which imphes 

W* e argmm/(PF, V*) 
w 

= argmin {l{W) + X{v*fh{W) - A/(v*)} 
w 

= avguiin {1{W) + X{v*fh{W)} . (12) 

w 

Therefore, the zero matrix O must be a sub-gradient of the objective function in Eq. (jl2p 
atW = W* : 

d 

o e diiW) + \d (K)^h(iy'^)) = di{w*) + A ^ (ll(w^)^' 111) , (13) 

where dliW*) denotes the sub-differential (which is a set composed of ah sub-gradients) of 
1{W) aXW = W*. We observe that 

v^'^ e 59(u)|„=h(iyw)' 

which imphes that Vx G M^: 

5(x) < 5(h(W^W)) + (vW,x - h(Tl^(^))) . 

Taking hmits on both sides of the above inequahty with ^ G /C — t- oo, we have: 

5(x) <5(h(VF*)) + (v^x-h(VF^)), 
which imphes that v* is a sub-gradient of g(u) at u = h(M^*), that is: 

V* G dg{M)\ (14) 
Substituting Eq. into Eq. p^ . we obtain: 

O E a/(iy*) + A^9min(||(w*)-'||i,^). 
Therefore, VF* is a critical point of Eq. ([1]). This completes the proof of Theorem [TJ ■ 

Remark 2 Note that the above theorem holds by assuming that there exists a limit point. 
Next, we need to prove that the sequence {W^^\\'^^^} has a limit point. For any bounded 
initial point {W^^\v^^^}, based on Eq. Eq. ^ and the monotonicity of f{W^^\v^^^), 
we have: 

/(H^(^)) + A(7(h(W^W)) < /(H^W,vW) < /(Ty(°),v(°)) < +oo,V^> 1. (15) 

Assume that the sequence {W^^\\'^^^} is unbounded, that is, there exist some i,j such that 
\wj;-^\ +00. It implies that /(W'W) + Xg{h(W^'^'i)) +oo (We exclude the case that 
some columns of Xi are zero vectors. Otherwise, we can simply remove the corresponding 
zero columns.) and hence f{W^^\v^^^) — )• -l-oo. This leads to a contradiction with Eq. US]). 
Thus, the sequence {W^^\\r^^^} is bounded and there exists at least one limit point {W*,v*), 
since any bounded sequence has limit points. 



8 



Multi-Stage Multi-Task Feature Learning 



Due to the equivalence between Algorithm [T] and the block coordinate descent algo- 
rithm above, Theorem [1] and its remark indicate that the sequence {1^*^^)} generated by 
Algorithm [T] has at least one limit point that is also a critical point of Eq. ([T]). The remain- 
ing issue is to analyze the performance of the critical point. In the sequel, we will conduct 
analysis in two aspects: reproducibility and the parameter estimation performance. 

3.2 Reproducibility of The Algorithm 

In general, it is difficult to analyze the performance of a non-convex formulation, as different 
solutions can be obtained due to different initializations. One natural question is whether 
the solution generated by Algorithm [1] (based on the initialization of A^''^ = A in Step 1) is 
reproducible. In other words, is the solution of Algorithm [1] unique? If we can guarantee 
that, for any i > 1, the solution W^^') of Eq. ([3]) is unique, then the solution generated by 
Algorithm [1] is unique. That is, the solution is reproducible. The main result is summarized 
in the following theorem: 

Theorem 3 If Xi £ ^"^i^d, ^ f^m) has entries drawn from a continuous probability 
distribution on then, for any £ > 1, the optimization problem in Eq. ^ has a unique 

solution with probability one. 

Proof Eq. ([3]) can be decomposed into m independent smaller minimization problems: 

d 

- W • ^ II \- l|2 , \(^-l)| I 

w)'=argmin ||AiWj-yj|| + 2^ A^ \wji\. 

Next, we only need to prove the solution of the above optimization problem is unique. To 
simplify the notations, we unclutter the above equation (by ignoring some superscripts and 
subscripts) as follows: 

^ ■■ " '|2 , 



w = argmin llXw — yll +} Xjlwjl, (16) 

The first order optimal condition is Vj G N^: 

2 

xj(y - Xw) = Ajsign(wj), (17) 

vj) — i, i± uuj ^ J; sign(tt>j^ 

2 



mn 

where sign(ti)j) = 1, if Wj > 0; sign{wj) = —1, if Wj < 0; and sign{wj) € [—1, 1], otherwise. 
We define 



S = i3^^d: r=rlxj(y - ^w)| = A, 



mn 

2 



s = sign Xg (y - Xw] 

\mn 

where X^ denotes the matrix composed of the columns of X indexed by 8. Then, the 
optimal solution w of Eq. (fTBIl satisfies 

wn,\£ = 0, 

W£: = argmin \\Xsv^e - y\? + ^,>^j\wi\, s.t. SjWj > 0,j G £, (18) 
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where w^; denotes the vector composed of entries of w indexed by £. Since X G M^^^^ 
is drawn from the continuous probabihty distribution, X has columns in general positions 
with probability one and hence Tank{Xs) = \ £\ (or equivalentl y Null(X£:) = {0}), due to 
Lemma 3, Lemma 4 and their discussions in IXibshiranil (I2012I '). Therefore, the objective 
function in Eq. (llSp is strictly convex, which implies that W£- is unique. Thus, the optimal 
solution w of Eq. (|16p is also unique and so is the optimization problem in Eq. ^ for any 
£ > 1. This completes the proof of Theorem [3l ■ 

Theorem [3] is important in the sense that it makes the theoretical analysis for the parameter 
estimation performance of Algorithm [1] possible. Although the solution may not be globally 
optimal, we show in the next section that the solution has good performance in terms of 
the parameter estimation error bound. 

4. Parameter Estimation Error Bound 

In this section, we theoretically analyze the parameter estimation performance of the so- 
lution obtained by the MSMTFL algorithm. To simplify the notations in the theoretical 
analysis, we assume that the number of samples for all the tasks are the same. However, 
our theoretical analysis can be easily extended to the case where the tasks have different 
sample sizes. 

We first present a sub-G aussian noise assurnption which i s very comn i on in the an alysis 
of sparse learning literature dzhang and Zhanel . Hoi^ : Izhand . boosl . l2009l . l20ld . I2OI2I '). 



Assumption 1 Let W = [wi,--- ,Wm] S '^dy.m underlying sparse weight matrix 

and Yi = XiWi + 5j, Eyj = XjWj, where Si G M" is a random vector with all entries 
^ji (j S N„,i E Nm) being independent suh-Gaussians: there exists cr > such that Vj G 



E^^,^ exp(t(5jj) < exp 



Remark 4 We call the random variable satisfying the condition in Assumption [7] sub- 
Gaussian, since its moment generating function is bounded by that of a zero mean Gaussian 
random variable. That is, if a normal random variable x ~ A^(0, a"^), then we have: 



Eexp(ia;) 



exp(tx' 

exp(cT^tV2) 
exp(cr^tV2)- 



1 



exp 



2'Ka 



exp 



{x - aH f 
2^2 



dx 



Remark 5 Based on the Hoeff ding's Lemma, for any random variable x G [a, b] and Ex = 
0, we have E(exp(ta;)) < exp ^^g"^ ^ • Therefore, both zero mean Gaussian and zero mean 
bounded random variables are sub-Gaussians. Thus, the sub-Gaussian noise assumption is 
more general than the Gaussian noise assumption which is commonly used in the multi-task 
learning literature I 'Jalali et al . 201 C : Lounici et al . 200 A ). 
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We next introduce the following sp arse eigenvalue concept which is also common in 
the analysis of spa r se lea rning literature ( Zhang and Huang , 20081 : Zhang and Zhang . 20121 : 
Zhang] . I2OO9I . I2OI0I . [ml) • 



Definition 6 Given I < k < d, we define 

\X,wP 



p+{k) = sup 

w 

Piik) = inf 



n||w||^ 
nllwlp 



: ||w||o < A; ^ , pl^axi^) = max p+{k) 



|w||o <k} , p^i^{k) = min (k). 

leNm 



Remark 7 pf{k) {p^{k)) is in fact the maximum (minimum) eigenvalue of {Xi)g{Xi)s/n, 
where S is a set satisfying \S\ < k and {Xi)s is a submatrix composed of the columns of Xi 
indexed by S. In the MTL setting, we need to exploit the relations of pf{k) {p~{k)) among 
multiple tasks. 



We present our parameter estimation error bound on MSMTFL in the following theorem: 



Theorem 8 Let Assumptions^ hold. Define Ti = {{j,i) ■ Wji 7^ 0} and T = ^iefim-^i- 
Denote f as the number of nonzero rows ofW. We assume that 



and 



V(j,i) e II w^' 111 > 29 
pr(2f + 2s) - 2f' 



(19) 
(20) 



where s is some integer satisfying s > r. If we choose A and 9 such that for some s > r: 



A > 12cji 



2/9max(l) \n{2dm/rf) 



n 



> 



11mA 



(21) 
(22) 



Pmini'^^ + sY 

then the following parameter estimation error bound holds with probability larger than 1 —r]: 



Ww^f-) -w\ 



2,1 



< O.J 



where W^^^ is a solution of Eq. 



+ 



9.1mA^/F 39.5mc7^p+,,(f)(7.4f + 2.71n(2/7?))/n 



Prmni'^r + S) 



(23) 



Remark 9 Eq. I119\) assumes that the £1 -norm of each nonzero row ofW is away from zero. 
This requires the true nonzero coefficients should be large enough, in orde r to dis t ing uish 
them from the noise. Eq. is called the sparse eigenvalue condition (Zhan^ . \20ih ). 

which requires the eigenvalue ratio pf{s)/p~{s) to grow sub-linearly wit h respect to s. Such 
a condition i s very common in the analysis of sparse regularization IZhana and Huant . 



2006 : \Zhand. \2009i) and it is slightly weak er than the RIP condition \Candes and Tag . 



MM; \Huana and Zhani \20id : \Zhanl \20ii ). 
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Remark 10 When i = 1 (corresponds to Lasso), the first term of the right-hand side of 
Eq. i23\} dominates the error bound in the order of 



\W 



Lasso 



w\ 



2,1 



O ( my^ r ln{dm/ri) jn 



(24) 



since A satisfies the condition in Eq. h21\) . Note that the first term of the right-hand side 
of Eq. i23\) shrinks exponentially as i increases. When i is sufficiently large in the order of 



0{ln{my^r /n) -\- lnln{dm)) , this term tends to zero and we obtain the following parameter 
estimation error bound: 



\\W^^^ - W\\2,i = O (m^yf/n + ln{l/T])/n^ 



(25) 



Jalali et all 1(20 id ) gave an i^ ^Q-norm error bound — W^||oo,oo — O ^y^hi(^dm/rj) jn 

as well as a sign consistency result between W and W . A direct comparison between these 
two bounds is difficult due to the use of different norms. On the othe r hand, the worst-case 
estimate of the i2^i-norm error bound of the algorithm in Jalali et~al\ 1(20 id ) is in the same 

order with Eq. [24^ , that is: \\W^^'^^y — W\\2,i = O (^m^yf ln{dm / rj) / . When dm is large 

and the ground truth has a large number of sparse rows (i.e., f is a small constant), the 
bound in Eq. i25\) is significantly better than the ones for the Lasso and Dirty model. 



Remark 11 Jalali et al. hold ) presented an (.c^^cxD'iT'Orm parameter estimation error bound 



and hence a sign consistency result can be obtained. The results are derived under the inco- 
herence condition which is more restrictive than the RIP condition and hence more restric- 
tive than the sparse eigenvalue condition in Eq. \20) . From the viewpoint of the parameter 





200^ : 


Zhanc . 


200L 



tions. Please refer to f Van De Geer and Biihlmanri . \200M; [Zhang, \20Qm, \20l2i ) for more 
details about the incoherence condition, the RIP condition, the sparse eigenvalue condition 
and their relationships. 



Remark 12 The capped-li regularized formulation in \Zha.nA hold ) is a special case of 
our formulation when m = 1. However, extending the analysis from the single task to the 
multi-task setting is nontrivial. Differe n t frorn previ ous work on multi-stage sparse learning 
which focuses on a single task (Zhan<\ . 20 IC . 201 A) , we study a more general multi-stage 
framework in the multi-task setting. We need to exploit the relationship among tasks, by 
using the relations of sparse eigenvalues pf{k) {p^{k)) and treating the ii-norm on each 
row of the weight matrix as a whole for consideration. Moreover, we simultaneously exploit 
the relations of each column and each row of the matrix. 



5. Proof Sketch of Theorem [8] 

In this section, we present a proof sketch of Theorem [8l We first provide several important 
lemmas (detailed proofs are available in the Appendix) and then complete the proof of 
Theorem [8] based on these lemmas. 



Lemma 13 Let T = [ei,--- ,em] with e, = 
Define H ^ F such that (j, i) G "H (Vi G 



-Xf (XiWi - yi) (i G N„ 



provided there exists {j,g) € T (Ti is a set 
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consisting of the indices of all entries in the nonzero rows ofW). Under the conditions of 
Assumption [I] and the notations of Theorem O the followings hold with probability larger 
than 1 — rj: 



IITII <r J '^pta.{l)H'^dm/r,) 

T 00,00 < cry , (26) 

y n 

nnWl < mcTV+ax(^)(7.4f + 2.71n(2/7?))/n. (27) 

Lemma [T3I gives bounds on the residual correlation (T) with respect to W. We note that 
Eq. (j26]) and Eq. ([27|) are closely related to the assumption on A in Eq. ([2T]) and the second 
term of the right-hand side of Eq. ()23|) (error bound) , respectively. This lemma provides a 
fundamental basis for the proof of Theorem [8j 

Lemma 14 Use the notations of Lemma [73 and consider Gi f^d ^ {i} such that J-i H 
= (i G N™). Let W = #W be a solution of Eq. (0) and l^W = W-W. Denote 
Xi = Xf^^'' = [\i • • • , A^^~"^^]^. Let Xg. = min(j^j)gg- Xji, Xg = minjeg. Xg. and Aoi = 

maxj Xji, Xq = maxj Aoi- //2||ei||oo < Xg-, then the following inequality holds at any stage 
£ > I: 

i=l(i,OGei ^ ^ll^lloo.OO i=l Q-i)ggc 

Denote G = Ui^^mGi, ^ = UieN^-^i and notice that T n G = f/} ^ AVF^ = W^^\ 
Lemma [14] says that ||AW(l^^||i,i = WW^^^W 1,1 is upper bounded in terms of || AWgl'* 
which indicates that the error of the estimated coefficients locating outside of ^ should be 
small enough. This provides an intuitive explanation why the parameter estimation error 
of our algorithm can be small. 

Lemma 15 Using the notations of Lemma I4, we denote G = G[i) = T-L^ r\{{j,i) : A^-^ = 
A} = UjgNm^* with a being defined as in Lemma{T^ and Gi ^ {^}- Let Ji be the 

indices of the largest s coefficients (in absolute value) of \vg^, Xj = Gf U Ji, I = Uii^n^Ii 
and T = Ujgisjm-^i- Then, the following inequalities hold at any stage £ > 1: 



;i + l-S^f) J 8m (4||T,. 11^ + Eu,^e^O^'') , 
\\^W(%,, < ^ \ (28) 

Pmini^r + s) 

\\AWi%,< _^-^-^f . (29) 

Lemma [15] is established based on Lemma [T^ by considering the relationship between 
Eq. ([2T|) and Eq. ([26]) , and the specific definition of ^ = G[e) ■ Eq. ([28]) provides a parameter 
estimation error bound in terms of ^2,1-iiorm by ||Tgc |||, and the regularization parameters 

^fi (see the definition of Xji (A^-^ ^^) in Lemma [H]) . This is the result directly used in the 
proof of Theorem [8] Eq. (I29p states that the error bound is upper bounded in terms of A, 
the right-hand side of which constitutes the shrinkage part of the error bound in Eq. (j23p . 
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Lemma 16 Let Xji = XI (||w-'||i <9,j e N^) , Vi G with some W G R'^^"'. Ti ^ F is 
defined in Lemma \13X Then under the condition of Eq. \19^) . we have: 

E E ^i<^A\Wn-Wn\\li/o^- 

Lemma [T6l establishes an upper bound of ^"ji \\^h~^h\\2,i^ which is critical 

for building the recursive relationship between — TV||2,i and HTy^^"^-* — H^||2,i in the 

proof of Theorem [HI This recursive relation is crucial for the shrinkage part of the error 
bound in Eq. 



5.1 Proof of Theorem [8] 

Proof For notational simplicity, we denote the right-hand side of Eq. (|27p as: 

u = ma^pl,,{r){7Ar + 2.71n(2/7?))/n. 



(30) 



Based on ^ C Q^^^ , Lemma [13] and Eq. (j2ip , the followings hold with probability larger 
than 1 — jy: 



I 11-^ 



inT-ii^ I iinr -11^ 



<u + \gie)\m\^\\lo,oo 

<u + X'^\gi^^\'H\/U4: 
<u+ (l/144)mA20-2||^(^-i) 



W, 



where the last inequality follows from 



(e-i) 



Wr.c 



According to Eq. (|28p . we have: 



2 



< 



< 



8m 1 + 1.5. 



4l|%jF + Eo-,i)e^(Aj-i 



78m 4m + (37/36)mA^6' 



mm 
2a-2 



2,1 



iPminC^r + S)y 



< 



312mu 



(P™n(2r + S))2 



+ 0.8 



< 



+ 



2,1 

312mu 



1-0.. 



< 0.8' 



g.l^m^A^f 



+ 



(P™„(2r + .))2 1-0.8 
1560mn 



iPm^ni'^r + sW (p-,j2f + s))2 



(31) 
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In the above derivation, the first inequality is due to Eq. (j28p ; the second inequahty is due 
to the assumption s > f in Theorem [8l Eq. (j3ip and Lemma [TBI the third inequality is 
due to Eq. ^] the last inequality follows from Eq. ([291) and 1 - 0.8^ < 1 > 1). Thus, 
following the inequality + b < ^/a + \fh {ya, b > 0), we obtain: 



Substituting Eq. (f30l) into the above inequality, we verify Theorem [8l 



Remark 17 The assumption s > f used in the above proof indicates that at each stage, 
the zero entries ofW^^^ should be greater than mr (see definition of s in Lemma[T5\) . This 
requires the solution obtained by Algorithm [1] at each stage is sparse, which is consistent 
with the sparsity of W in AssumptionU^ 



6. Experiments 

In this section, we present empirical studies on both synthetic and real-world data sets. In 
the synthetic data experiments, we present the performance of the MSMTFL algorithm in 
terms of the parameter estimation error. In the real-world data experiments, we show the 
performance of the MSMTFL algorithm in terms of the prediction error. 



6.1 Competing Algorithms 

We present the empirical studies by comparing our proposed MSMTFL algorithm with three 
competing multi-task feature learning algorithms: ^i-norm multi-task fe ature learning al- 



gorit hm (Lasso), ^1^2-iiorm multi-task feature learning algorithm (LI, 2) (jObozinski et al. 



2006 ) and dirty model multi-task feature learning algorithm (DirtyMTL) (Jalali et al. 



2010 ). In our experiments, we employ the quadratic loss function in Eq. ([2]) for all the 



compared algorithms. 



6.2 Synthetic Data Experiments 

We generate synthetic data by setting the number of tasks as m and each task has n samples 
which are of dimensionality d; each element of the data matrix Xi G M"^'^ [i E N^) for the 
i-th task is sampled i.i.d. from the Gaussian distribution A^(0, 1) and we then normalize all 
columns to length 1; each entry of the underlying true weight W € R'^^"* is sampled i.i.d. 
from the uniform distribution in the interval [—10, 10]; we randomly set 90% rows of W as 
zero vectors and 80% elements of the remaining nonzero entries as zeros; each entry of the 
noise 6i G M" is sampled i.i.d. from the Gaussian distribution N{0,a'^); the responses are 
computed as = XjWj + Si {i £ Nm)- 

We first report the averaged parameter estimation error \\W — W\\2,i vs. Stage {£) 
plots for MSMTFL (Figure [1]). We observe that the error decreases as £ increases, which 
shows the advantage of our proposed algorithm over Lasso. This is consistent with the 
theoretical result in Theorem[8l Moreover, the parameter estimation error decreases quickly 
and converges in a few stages. 
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We then report the averaged parameter estimation error — W^||2,i in comparison with 
four algorithms in different parameter settings (Figure[2]). For a fair comparison, we c ompare 
the s r nahes t estimation errors of the four algorithms in all the parameter settings (jZhang . 
20091 . l2010l ^. As expected, the parameter estimation error of the MSMTFL algorithm is the 



smallest among the four algorithms. This empirical result demonstrates the effectiveness 
of the MSMTFL algorithm. We also have the following observations: (a) When A is large 
enough, all four algorithms tend to have the same parameter estimation error. This is 
reasonable, because the solutions W^s obtained by the four algorithms are all zero matrices, 
when A is very large, (b) The performance of the MSMTFL algorithm is similar for different 
0's, when A exceeds a certain value. 



6.3 Real- World Data Experiments 

We conduct experiments on two real-world data sets: MRI and Isolet data sets. 

The MRI data set is collected from the ANDI database, which contains 675 patients' 
MRI data preprocessed using FreeSurfeiEI. The MRI data include 306 features and the 
response (target) is the Mini Mental State Examination (MMSE) score coming from 6 
different time points: M06, M12, M18, M24, M36, and M48. We remove the samples which 
fail the MRI quality controls and have missing entries. Thus, we have 6 tasks with each 
task corresponding to a time point and the sample sizes corresponding to 6 tasks are 648, 
642, 293, 569, 389 and 87, respectively. 

The Isolet data sell is collected from 150 speakers who speak the name of each English 
letter of the alphabet twice. Thus, there are 52 samples from each speaker. The speakers 
are grouped into 5 subsets which respectively include 30 similar speakers, and the subsets 
are named Isoletl, Isolet2, Isolet3, Isolet4, and Isolet5. Thus, we naturally have 5 tasks 
with each task corresponding to a subset. The 5 tasks respectively have 1560, 1560, 1560, 
1558, and 1559 sample^, where each sample includes 617 features and the response is the 
English letter label (1-26). 



Table 1: Comparison of four multi-task feature learning algorithms on the MRI data set in 
terms of averaged nMSE and aMSE (standard deviation), which are averaged over 
10 random splittings. 



measure 


traning ratio 


Lasso 


Ll,2 


DirtyMTL 


MSMTFL 


nMSE 


0.15 
0.20 
0.25 


0.6651(0.0280) 
0.6254(0.0212) 
0.6105(0.0186) 


0.6633(0.0470) 
0.6489(0.0275) 
0.6577(0.0194) 


0.6224(0.0265) 
0.6140(0.0185) 
0.6136(0.0180) 


0.5539(0.0154) 
0.5542(0.0139) 
0.5507(0.0142) 


aMSE 


0.15 
0.20 
0.25 


0.0189(0.0008) 
0.0179(0.0006) 
0.0172(0.0009) 


0.0187(0.0010) 
0.0184(0.0005) 
0.0183(0.0006) 


0.0172(0.0006) 
0.0171(0.0005) 
0.0167(0.0008) 


0.0159(0.0004) 
0.0161(0.0004) 
0.0157(0.0006) 



1. Iwww . loni . ucla . edu/ADNI/l 

2. www .zjucadcg. cn/dengcai/Data/data.html 

3. Three samples are historically missing. 
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m=15,n=40,d=250,a=0.01 



-e — a=5e-005 
-B — a=0.0001 
-0— a=0.0002 
a=0.0005 




6 

Stage 

m=20,n=30,d=200,o=0.005 



10 



-e — a=5e-005 
-H — a=0.0001 
i.0002 
i.0005 




6 

Stage 

m=10,n=6ad=300,a=0.001 



10 




Figure 1: Averaged parameter estimation error \\W — W\\2,i vs. Stage {£) plots for 
MSMTFL on the synthetic data set (averaged over 10 runs). Here we set 
A = ay^ln{dm) /n, 6 = 50mA. Note that 1 = 1 corresponds to Lasso; the re- 
sults show the stage-wise improvement over Lasso. 



In the experiments, we treat the MMSE and letter labels as the regression values for the 
MRI data set and the Isolet data set, respectively. For both data sets, we randomly extract 
the training samples from each task with different training ratios (15%, 20% and 25%) and 
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m=15,n=40,d=250,o=0.01 




m=20,n=30,d=200,o=0.005 




m=1 0,n=60,d=300,a=0.001 




Lasso 

— LI ,2 

^ — • DirtyMTL{1 X) 

DirtyMTL(0.5;L) 
-A— DirtyMTL(0.2X) 
-V— DirtyMTL(O.U) 
-e— MSMTFL{50m?L) 
-B— MSMTFL{10mX) 
-0— MSMTFL(2m?L) 

— MSMTFL(0.4m>t) 



Figure 2: Averaged parameter estimation error \\W — W\\2^i vs. A plots on the synthetic 
data set (averaged over 10 runs). MSMTFL has the smahest parameter estima- 
tion error among the four algorithms. Both DirtyMTL and MSMTFL have two 
parameters ; we set X.^/Xh = 1 ,0.5,0.2,0.1 for DirtyMTL (1/m < Xg/Xb < 1 was 
adopted in ljalah et aP (|2O10l )') and e/X = 50m, 10m, 2m, 0.4m for MSMTFL. 
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use the rest of samples to form the test set. We evaluate the four multi-task feature learning 
algorithms in terms of normalized mean squared error (nMSE) and avera ged means squared 
error (aMSE), wh i ch are commonly u sed in multi-task learning problems (jZhang and Yeungi . 
201(]| : IZhou et aliboill : ICong et all [2012). For each training ratio, both nMSE and aMSE 
are averaged over 10 random splittings of training and test sets and the standard deviation 
is also shown. All parameters of the four algorithms are tuned via 3-fold cross validation. 



0.7 



0.65 



LU 

^ 0.6 



0.55 



0.5 



0.15 




— DirtyMTL 
MSMTFL 



0.2 

Training Ratio 



0.25 




0.15 



0.2 

Training Ratio 



0.25 



Figure 3: Averaged test error (nMSE and aMSE) vs. training ratio plots on the Isolet data 
set. The results are averaged over 10 random splittings. 



Table [Hand Figure [3] show the experimental results in terms of averaged nMSE (aMSE) 
and the standard deviation. From these results, we observe that: (a) Our proposed 
MSMTFL algorithm outperforms all the competing feature learning algorithms on both 
data sets, with the smallest regression errors (nMSE and aMSE) as well as the smallest 
standard deviations, (b) On the MRI data set, the MSMTFL algorithm performs well even 
in the case of a small training ratio. The performance for the 15% training ratio is compa- 
rable to that for the 25% training ratio, (c) On the Isolet data set, when the training ratio 
increases from 15% to 25%, the performance of the MSMTFL algorithm increases and the 
superiority of the MSMTFL algorithm over the other three algorithms is more significant. 
Our results demonstrate the effectiveness of the proposed algorithm. 

7. Conclusions 

In this paper, we propose a non-convex formulation for multi-task feature learning, which 
learns the specific features of each task as well as the common features shared among 
tasks. The non-convex formulation adopts the capped-£i,£i regularizer to better approxi- 
mate the ^o-type one than the commonly used convex regularizer. To solve the non-convex 
optimization problem, we propose a Multi-Stage Multi-Task Feature Learning (MSMTFL) 
algorithm and provide intuitive interpretations from several aspects. We also present a de- 
tailed convergence analysis and discuss the reproducibility issue for the proposed algorithm. 
Specifically, we show that, under a mild condition, the solution generated by MSMTFL is 
unique. Although the solution may not be globally optimal, we theoretically show that 
it has good performance in terms of the parameter estimation error bound. Experimen- 
tal results on both synthetic and real-world data sets demonstrate the effectiveness of our 
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proposed MSMTFL algorithm in comparison with the state of the art multi-task feature 
learning algorithms. 

In our future work, we will explore the conditions under which a globally optimal solution 
of the proposed formulation can be obtained by the MSMTFL algorithm. We will also focus 
on a general non-convex regularization framework for multi-task learning settings (involving 
different loss functions and non-convex regularization terms) and derive theoretical bounds. 
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Appendix 

In this appendix, we provide detailed proofs for Lenimas [T3] to [T6j In our proofs, we use 
several lemmas (summarized in part B) from Zhang ( 20101 ) . 
We first introduce some notations used in the proof. Define 



sup 



v^4'^ ^ullvl 



vGK'=suGlR=i,Ii,Ji v'^ylV T vllu 



(32) 



where Si + ki < d with Si,ki > 1; Xj and Ji are disjoint subsets of with ki and Sj elements 
respectively (with some abuse of notation, we also let Xi be a subset of x {i}, depending 
on the context.); A^x- j- ^ sub-matrix of Ai = n~^Xj Xi € W^^'^ with rows indexed by Xj 
and columns indexed by Ji. 

We let wj^ be a d X 1 vector with the j-th entry being Wji, if (j, i) G Xj, and 0, otherwise. 
We also let Wx be a dx m matrix with (j, i)-th entry being Wji, if (j, i) G X, and 0, otherwise. 

A. Proofs of Lemmas [13] to [16] 

A.l. Proof of Lemma 1131 

Proof For the j-th entry of ej (j G N^): 



x 



(0 



where x^*^ is the j-th column of X^. We know that the entries of 6i are independent 

sub-Gaussian random variables, and ||l/nx^*^|p = ||xj*^|p/n^ < pf{l)/n. According to 
Lemma [TU we have Vt > 0: 

Pr(|e,,| >t)< 2exp(-ntV(2aV+(l))) < 2exp(-ntV(2aV+a.(l)))- 
Thus we obtain: 

Pr(||T||oo,oo < t) > 1 - 2(imexp(-nt2/(2aV;J;ax(l)))- 

Let -q = 2dmexp(— nt^/(2cr^p+^^(l))) and we can obtain Eq. (f26l) . Eq. (f27l) directly follows 
from Lemma [2T] and the following fact: 



< ayi =^ II-'^IIf = < ?7ia max 
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A. 2 Proof of Lemma 1141 

Proof The optimality condition of Eq. ^ implies that 

-XfiXiWi - Yi) + Aj sign(wi) = 0, 
n 

where denotes the element-wise product; sign(w) = [sign(t(;i), • • • ,sign{wd)]'^ , where 
sign(t(;j) = 1, if Wi > 0; sign(?i;j) = —1, if Wi < 0; and sign(ti;j) G [—1,1], otherwise. We 
note that XjWj — yi = XjWj — XjWj + XjWj — yj and we can rewrite the above equation 
into the following form: 

2AiAwi = -2ei - Aj sign(wi). 
Thus, for all v G M"', we have 

d 

2v^^i Awj = -2v^ej - ^ XjiVjSign{wji). (33) 

i=i 

Letting v = Awj and noticing that Awji = wji for (j, i) ^ € Nm, we obtain 

d 

< 2Awf A^Awi = — 2Awfej — XjiAwjisign{wji) 

i=i 

< 2||Awj||i||ei||oo - ^ XjiAwjisign{wji) - ^ \jiAwjisign{wji) 

< 2||Awi||i||ei||oo + ^ \ji\Awji\- ^ \ji\wji\ 

< 2||Awi||i||ei||oo + ^ Xji\Awji\ - ^ >^ji\wji\ 

< 2||Awi||i||ei||oo + ^ Xoi\Awji\- ^ Xg^\wj 

= ^ {2\\ei\\oo - Xg,)\wji\ + ^ 2||ej||oo|i()jj| + ^ (2||ej||oo + Aoi)|Att;ji|. 

The last equality above is due to x {i} = GiU {TiUGiyu^i and Awji = Wji,y{j, i) ^ 
Qi. Rearranging the above inequality and noticing that 2||ej||oo < Xg. < Aoi, we obtain: 

El - I ^ -^ll^illoo \ ^ I ^ I , ^ll^illoo + Xoi 

<ff^^^l|Awg.Hli. (34) 

'^Gi 1 1 I loo 

Then Lemma [T3] can be obtained from the above inequality and the following two inequal- 
ities. 

•^Il'^i ||oo + Aoi , 2||T|| 

oo,oo + Ao , V- 

max < r and > < x y i- 

ieN^ Ag, -2||e,|U Ag - 2||T||oo,oo 
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A. 3 Proof of Lemma 

Proof According to the definition of Q iG{e)), we know that Gi = 9 {i £ Nm) and 

i) ^ G {G{e)), ^fi = A. Thus, all conditions of Lemma [14] are satisfied, by noticing the 
relationship between Eq. (j2ip and Eq. (j26p . Based on the definition of G iG(i)), we easily 
obtain Vj S N^: 

eGi,yienm or ^GiyieNm. (35) 

and hence ki = \Gi\ = ■ ■ ■ = \Gm\ i^e is some integer). Now, we assume that at stage I > 1: 

h = \G'i\ = --- = \G^\<2r. (36) 

We will show in the second part of this proof that Eq. (j36p holds for all i. Based on 
Lemma fTU] and Eq. (|20p . we have: 



7rii2f + s,s) < '-^^p+{s)/pT{2f + 2s)-l 

,1/2 



<— Vl + V(2f)-1 
= 0.5s(2f)"i/^ 

which indicates that 

0.5 < ti = 1 - 7ri(2f + s, s)(2f)^/^s"^ < 1. 
For all ti £ [0.5, 1], under the conditions of Eq. ([2T]) and Eq. ([26|) . we have 

~ 2||ej||oo A 2||T||oo,oo 5 4 3tj 
Following Lemma [m we have 

\\Wg\\i,i < 3\\AWg4i^i = 3\\AW - AWg\\i,i = 3\\AW - Wg\\i^i. 

Therefore 

\\AW - AWx\\oo,i = \\AWg - AWj\\^,i 

< \\AWj\\i,Js = (||ATyg||i,i - \\AW - AWi\\i,^)/s 

< s-\3\\AW - Wg\\i,i - \\AW - AWx\\i,i), 
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which imphes that 

||A#||2,i - ||AW^x||2,i < ||A# - AT^j||2,i 

< i\\AW - ATyx||i,i||ATy - APl^x||oo,i)^/2 

< (\\AW - AWx\\i,i) (s~^(3||AVF- - \\AW - AWx\\i,iy 



< (^l^3\\AW -Wg\\i,i/2y^ ^ 

< {3/2)s-^/\2r)^/^\\AW -Wg\\2,i 

< (3/2)(2f/s)i/2||AWx||2,i. 

In the above derivation, the third inequahty is due to a{3b — a) < (36/2)^, and the fourth 
inequahty fohows from Eq. (j36|) and T Ci Q = ^ ^ AWg = Wg. Rearranging the above 
inequahty, we obtain at stage £: 

||AW||2,i ^ + ^-^V?) "^^^"2,i- (37) 

From Lemma [20l we have: 

max(0, Awj, AjAwj) 

> pr{ke + s)(||AwjJ| - Tri{ke + s, s)||wg;J|i/s)||AwxJ| 

> pr{ke + s)[l - (1 - t,)(4 - t,)/(4 - 3t,)]||Awx.f 

> 0.5t,/)r(A;£ + s)||AwxJ|2 

> 0.25p7(2f + s)||AwiJ|2 
>0.25p-,j2f + s)||Awx,f, 



where the second inequahty is due to Eq. (plj) . that is 

II - II ^ 2 ej oo + Aoi II , „ II 

l|wgj|l < — — — ||Avi^gc||^ 

^Gi 2 1 1 e j 1 1 oo 

/ (2||ei||oo + Aoj)\/Aj II A - II 

< Awcjc 

\ 9llc II 

^Gi ■^||fcj||oO 

^ (2||ej||oo + Aoj)-y/^ii A - II 

< ||AwxJ| 

^Gi ~ 2||ej||oo 
<ii^M^^||Awx.||; 

the third inequahty fohows from 1 — (1 — ti)(4 — tj)/(4 — 3tj) > O.Stj for G [0.5, 1] and the 
fourth inequahty fohows from the assumption in Eq. (|36p and ti > 0.5. 

If Aw'^ AiAwi < 0, then ||AwxJ| = 0. If Aw^'.^jAwj > 0, then we have 

A^lAA^, > 0.25p-,j2f + s)||AwxJ|'. (38) 
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By letting v = Awj. , we obtain the following from Eq. (j33p : 



^ Xji\Awji 
(i,i)e-?fngf 



2Awj.^jAwj = — 2Awj.ej — AjjAi()jjsign( 

(i,i)62:, 

= -2Aw|,egc - 2Aw£eg, - ^ AjiAu'jisign(wji) - ^ AjilAu- 

<2||AwxJ|||egc||+2||ej-JU ^ l^^l + E ^'^l^^l " E ^J^'^^. 

(i,«)GJi (i,j)e^, (i,*)^^^ 

<2||AwxJ|||egc||+ ^ A|, IIAw^JI 

<2||AwxJ|||6gc||+ [ ^ A|J IIAwxJI. (39) 



w 



In the above derivation, the second equality is due to Xi = JiU Ti U (Tl n Qf)\ the third 
equality is due to TiCyQi = Ji\ the second inequality follows from £ Ji-,Xji = A > 

2 II ei II CO > 2||ej-J|oo and the last inequality follows from Fi C ^? C Xj. Combining Eq. (|38]l 
and Eq. (j39]) . we have 



IIAwxJI < 



1/2- 



2ii6-,Hi+ E ^: 



Notice that 



|xi|| < a(||yi|| + ||zi 



ll^lli,! < "^ll^lll = "^E ll^'H' ^ 2ma2(||y||^ + ||Z|||). 



Thus, we have 



||A#r||2,i < 



2 , V- _r\(^-i)\2 



8 m(4||Tg.J|^ + Ea,), ^(A}r 



(40) 
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Therefore, at stage £, Eq. (j28p in Lemma [T5] directly follows from Eq. (|37p and Eq. (j40p . 
Following Eq. (p8|) . we have: 

- Ty||2,i = ||ATy(^)||2,i 



< 



1 + l-S^f ) (4||Tgc ^ III + E(,.).^(Air'^)^) 



8.83V^^4||T||2^^^|6;(-^)|+fmA2 



< 



< 



9.1mX\/^ 



where the first inequality is due to Eq. ([lOj) : the second inequality is due to s > f (assump- 
tion in Theorem[8]), Xji < A, fm = {Til > |-F| and the third inequality follows from Eq. (I36p 
and ||T||^ oo < (1/144)A2. Therefore, Eq. in Lemma US] holds at stage i. 

Notice that we obtain Lemma [15] at stage i, by assuming that Eq. ()36p is satisfied. To 
prove that Lemma [15] holds for all stages, we next need to prove by induction that Eq. (j36p 
holds at all stages. 

When ^ = 1, we have G^-^^ = H, which implies that Eq. ([361) holds. Now, we assume 
that Eq. ([36]) holds at stage i. Thus, by hypothesis induction, we have: 



2,1 



< \/fm, 

where 9 is the thresholding parameter in Eq. ([T]); the first inequality above follows from the 
definition of ^(^) in Lemma [T5l 

V(j,i) e gie+i)\n, ll(w('))^ll?/^' = ll(w('))^' - ^'\\l/e' > 1 

the last inequality is due to Eq. ([22]) . Thus, we have: 

Therefore, Eq. (I36p holds at all stages. Thus the two inequalities in Lemma [15] hold at all 
stages. This completes the proof of the lemma. ■ 



26 



Multi-Stage Multi-Task Feature Learning 



A. 4 Proof of Lemma 1161 

Proof The first inequality directly follows from % ^ T. Next, we focus on the second 
inequality. For each (j, i) G F {'H)-, if ||w-'||i < 9, by considering Eq. (fT9]l . we have 

||w^' - w-^'iii > llw-^'lli - ||w^'||i >2e-e = e. 

Therefore, we have for each (j, G J- (Ti)- 

/(llw^'lli <e) < \\w^ -w^\i/e. 

Thus, the second inequality of Lemma [TU] directly follows from the above inequality. ■ 



B. Lemmas from [Zhang ( l2010h 



Lemma 18 Let a G M" be a fixed vector and x G M" be a random vector which is composed 
of independent sub-Gaussian components with parameter a. Then we have: 

Pr(|a^x| > t) < 2exp {-t'^ /{2a'^\\af)) ,Vt > 0. 

Lemma 19 The following inequality holds: 

1/2 



7Ti{ki,Si) < -^\J PjiSi)/ Pi {ki + Si) - 1. 

Lemma 20 Let C x {i} such that \Gf\ = ki, and let Ji be indices of the Si largest 
components (in absolute values) of wg. and Xj = Qf yj Ji- Then for any Wj G , we have 

max(0, Vir|]^jVi^j) > pT(ki + Sj)(||v^^xJ| - TTiih + Si,Si)||wg;J|i/si)||v^^xJ|. 

Lemma 21 Let = [en, ■■■ , e^j] = }^Xf{XiWi - y^) (i G N™,), and 'Hi'^'MdX {i}- Under 
the conditions of Assumption^ the followings hold with probability larger than 1 — r]: 

Wenf < aV(l^.l)(7.4|?^.| +2.71n(2/,7))/n. 
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